438 research outputs found

    Adaptively Parallel Processor Allocation for Cilk Jobs

    Get PDF
    The problem of allocating processor resources fairly and efficiently to parallel jobs has been studied extensively in the past. Most of this work, however, assumes that the instantaneous parallelism of the jobs is known and used by the job scheduler to make its decisions. In this project, we consider different ways of estimating the parallelism of jobs during runtime, as well as different ways of using this information to allocate processors in a fair and efficient manner. The goal of our project is to design and implement a dynamic processor-allocation system for adaptively parallel jobs. Adaptively parallel jobs are jobs for which the number of processors which can be used without waste—in other words, the parallelism of each job—varies during execution. We call the problem of allocating processors to adaptively parallel jobs the adaptively parallel processor-allocation problem. We propose to investigate the adaptively parallel processor-allocation problem for jobs running on a shared-memory multiprocessor (SMP) system. We focus on the specific case of parallel jobs that are scheduled with the randomized work-stealing algorithm, as is used in the Cilk multithreaded language (later, we will expand the scope of our research to include other kinds of parallel jobs). Our problem can be defined as follows: Consider an SMP system with P processors and J jobs. At any given time, each job has a desire dj, representing the maximum number of efficiently usable processors, and an allotment mj, representing the number of processors allocated to it. Our problem is to design a processor-allocation system that achieves a fair and efficient allocation of processors among these jobs, where the terms “fair” and “efficient” are defined as follows: an allocation is fair if whenever a job receives fewer processors than it desires, then no other job receives more than one more processor than this job received (the allowance of one processor is due to integer roundoff); an allocation is efficient if no job receives more processors than it desires and, if there exists a job that receives fewer processors than it desires, then all P processors are in use. The successful design of such a system can contribute significantly to our understanding of the adaptively parallel processor-allocation problem. If the task of estimating processor desires is designated to the jobs themselves—and if algorithms are devised to perform this estimation for different types of parallel jobs—then the system can easily be generalized into a core scheduling service provided by the kernel. In addition, the system can be extended to respect the job priority mechanism supported by the kernel, or to take into account processor usage histories when making the allocation decisions. Moreover, we can consider other ways of characterizing the fairness and efficiency of processor allocations beyond the definitions we have provided above. Currently, the prototype of the processor-allocation system we are developing is a user-mode extension to Cilk that uses the steal rate of Cilk jobs to estimate processor desires and a randomized, pair-wise exchange of processors to achieve a fair and efficient allocation. The prototype has a well-defined interface and a modular architecture, allowing us to investigate different desire-estimation and processor-allocation algorithms without too much difficulty.Singapore-MIT Alliance (SMA

    Hierarchical Scheduling for Multicores with Multilevel Cache Hierarchies

    Get PDF
    Cache-locality is an important consideration for the performance in multicore systems. In modern and future multicore systems with multilevel cache hierarchies, caches may be arranged in a tree of caches, where a level k cache is shared between Pk processors, called a processor group, and Pk increases with k. In order to get good performance, as much as possible, subcomputations that share more data should execute on processors which share a lower-level cache. Therefore, the number of cache misses in these systems depends on the scheduling decisions, and a scheduler is responsible for not just achieving good load-balance and low overheads, but also good cache complexity. However, these can be competing criteria. In this paper, we explore the tension between these criteria for online hierarchical schedulers. Formally, we consider a system with P processors, arranged in a multilevel hierarchy according to a hierarchy tree, where each of the P processors forms a leaf of the tree, and an internal node at level-k corresponds corresponds to a processor group. In addition, we assume that computations have locality regions, that represent parallel subcomputations that share data. Each locality region has a particular level, and the scheduler must ensure that a level-k locality region is executed by processors in the same level-k processor group, since they share a level k cache. Thus locality regions can improve cache performance. However, they may also impair load-balance and increase scheduling overheads since the scheduler must obey the restrictions posed by locality regions. In this paper, we present a framework of hierarchical computations, that is, computations with locality regions at multiple levels of nesting. We describe the hierarchical greedy scheduler, where each locality region is scheduled using a greedy scheduler which attempts to use as many processors as possible while obeying the restrictions posed by the locality regions. We derive a recurrence for the time complexity for a region in terms of its nested regions. We also describe how a more realistic hierarchical work-stealing scheduler can get the same bounds apart from constant factors for an important subclass of computations called homogenous computations. Finally, we also analyze the cache complexity of the hierarchical work-stealing scheduler for a system with a multilevel cache hierarchy

    A Measurement-Based Model for Parallel Real-Time Tasks

    Get PDF
    Under the federated paradigm of multiprocessor scheduling, a set of processors is reserved for the exclusive use of each real-time task. If tasks are characterized very conservatively (as is typical in safety-critical systems), it is likely that most invocations of the task will have computational demand far below the worst-case characterization, and could have been scheduled correctly upon far fewer processors than were assigned to it assuming the worst-case characterization of its run-time behavior. Provided we could safely determine during run-time when all the processors are going to be needed, for the rest of the time the unneeded processors could be idled in low-energy "sleep" mode, or used for executing non-real time work in the background. In this paper we propose a model for representing parallelizable real-time tasks in a manner that permits us to do so. Our model does not require us to have fine-grained knowledge of the internal structure of the code represented by the task; rather, it characterizes each task by a few parameters that are obtained by repeatedly executing the code under different conditions and measuring the run-times

    Intractability Issues in Mixed-Criticality Scheduling

    Get PDF
    In seeking to develop mixed-criticality scheduling algorithms, one encounters challenges arising from two sources. First, mixed-criticality scheduling is an inherently an on-line problem in that scheduling decisions must be made without access to all the information that is needed to make such decisions optimally - such information is only revealed over time. Second, many fundamental mixed-criticality schedulability analysis problems are computationally intractable - NP-hard in the strong sense - but we desire to solve these problems using algorithms with polynomial or pseudo-polynomial running time. While these two aspects of intractability are traditionally studied separately in the theoretical computer science literature, they have been considered in an integrated fashion in mixed-criticality scheduling theory. In this work we seek to separate out the effects of being inherently on-line, and being computationally intractable, on the overall intractability of mixed-criticality scheduling problems. Speedup factor is widely used as quantitative metric of the effectiveness of mixed-criticality scheduling algorithms; there has recently been a bit of a debate regarding the appropriateness of doing so. We provide here some additional perspective on this matter: we seek to better understand its appropriateness as well as its limitations in this regard by examining separately how the on-line nature of some mixed-criticality problems, and their computational complexity, contribute to the speedup factors of two widely-studied mixed-criticality scheduling algorithms

    The Safe and Effective Use of Learning-Enabled Components in Safety-Critical Systems

    Get PDF
    Autonomous systems increasingly use components that incorporate machine learning and other AI-based techniques in order to achieve improved performance. The problem of assuring correctness in safety-critical systems that use such components is considered. A model is proposed in which components are characterized according to both their worst-case and their typical behaviors; it is argued that while safety must be assured under all circumstances, it is reasonable to be concerned with providing a high degree of performance for typical behaviors only. The problem of assuring safety while providing such improved performance is formulated as an optimization problem in which performance under typical circumstances is the objective function to be optimized while safety is a hard constraint that must be satisfied. Algorithmic techniques are applied to derive an optimal solution to this optimization problem. This optimal solution is compared with an alternative approach that optimizes for performance under worst-case conditions, as well as some common-sense heuristics, via simulation experiments on synthetically-generated workloads

    Scheduling and synchronization for multicore concurrency platforms

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 217-230).Developing correct and efficient parallel programs is difficult since programmers often have to manage low-level details like scheduling and synchronization explicitly. Recently, however, many hardware vendors have been shifting towards building multicore computers. This trend creates an enormous pressure to create concurrency platforms - platforms that provide an easier interface for parallel programming and enable ordinary programmers to write scalable, portable and efficient parallel programs. This thesis provides some provably-good practical solutions to problems that arise in the implementation of concurrency platforms, particularly in the domain of scheduling and synchronization. The first part of this thesis describes work on scheduling of parallel programs written in dynamic multithreaded languages (such as Cilk, Hood etc.). These languages allow the programmer to express parallelism of their code in a natural manner, while an automatic scheduler in the concurrency platform is responsible for scheduling the program on the underlying parallel hardware. This thesis presents designs to increase the functionality of these concurrency platforms. The second part of the thesis presents work on transactional memory semantics and design. Transactional memory (TM), has been recently proposed as an alternative to locks. TM provides a transactional interface to memory. The programmers can specify their critical sections inside a transaction, and the TM concurrency platform guarantees that the region executes atomically. One of the purported advantages of TM over locks is that transactional code is composable.(cont.) Most of the current TM concurrency platforms do not support full composability, however. This thesis addresses two of the composability problems in existing TM concurrency platforms.by Kunal Agrawal.Ph.D

    Locality-Aware Dynamic Task Graph Scheduling

    Get PDF
    Dynamic task graph schedulers automatically balance work across processor cores by scheduling tasks among available threads while preserving dependences. In this paper, we design NabbitC, a provably efficient dynamic task graph scheduler that accounts for data locality on NUMA systems. NabbitC allows users to assign a color to each task representing the location (e.g., a processor core) that has the most efficient access to data needed during that node’s execution. NabbitC then automatically adjusts the scheduling so as to preferentially execute each node at the location that matches its color—leading to better locality because the node is likely to make local rather than remote accesses. At the same time, NabbitC tries to optimize load balance and not add too much overhead compared to the vanilla Nabbit scheduler that does not consider locality. We provide a theoretical analysis that shows that NabbitC does not asymptotically impact the scalability of Nabbit . We evaluated the performance of NabbitC on a suite of memory intensive benchmarks. Our experiments indicates that adding locality awareness has a considerable performance advantage compared to the vanilla Nabbit scheduler. In addition, we also compared NabbitC to OpenMP programs for both regular and irregular applications. For regular applications, OpenMP achieves perfect locality and perfect load balance statically. For these benchmarks, NabbitC has a small performance penalty compared to OpenMP due to its dynamic scheduling strategy. For irregular applications, where OpenMP can not achieve locality and load balance simultaneously, we find that NabbitC performs better. Therefore, NabbitC combines the benefits of locality- aware scheduling for regular applications (the forte of static schedulers such as those in OpenMP) and dynamically adapting to load imbalance (the forte of dynamic schedulers such as Cilk Plus, TBB, and Nabbit)

    Reservation-Based Federated Scheduling for Parallel Real-Time Tasks

    Full text link
    This paper considers the scheduling of parallel real-time tasks with arbitrary-deadlines. Each job of a parallel task is described as a directed acyclic graph (DAG). In contrast to prior work in this area, where decomposition-based scheduling algorithms are proposed based on the DAG-structure and inter-task interference is analyzed as self-suspending behavior, this paper generalizes the federated scheduling approach. We propose a reservation-based algorithm, called reservation-based federated scheduling, that dominates federated scheduling. We provide general constraints for the design of such systems and prove that reservation-based federated scheduling has a constant speedup factor with respect to any optimal DAG task scheduler. Furthermore, the presented algorithm can be used in conjunction with any scheduler and scheduling analysis suitable for ordinary arbitrary-deadline sporadic task sets, i.e., without parallelism
    • …
    corecore